Educative: Interactive Courses for Software Developers

There are some alternatives available if we don’t want to use built-in options in the databases. We can use third-party search engines or totally database-independent search engines. These are discussed in detail in this lesson.

Third-party search engines#

If we need to search text in a way that works the same way regardless of which database brand we use, we need a search engine that runs independently of the SQL database. This section briefly describes two such products, Sphinx Search and Apache Lucene.

Sphinx Search#

Sphinx Search is an open-source search engine technology that integrates well with MySQL and PostgreSQL. As of the time this course was written, an unofficial patch existed for using Sphinx Search with the open-source Firebird database. Perhaps in the future, this search engine will support other databases.

Indexing and searching are fast in Sphinx Search, and it supports distributed queries as well. It’s a good choice for high-scale searching applications that have data that update infrequently.

We can use Sphinx Search to index data stored in a MySQL database. By modifying a few fields in a configuration file sphinx.conf, we can specify the database. We must also write an SQL query to fetch the data for building the index. The first column in this query is the integer primary key. We may declare some columns as attributes for restricting or sorting results. The remaining columns are those to be full-text indexed. Finally, another SQL query fetches a full row from the database, given a primary key value coded as ＄id.

source bugsrc {

type = mysql sql_user = bugsuser sql_pass = xyzzy sql_db = bugsdatabase sql_query = \

SELECT bug_id, status, date_reported, summary, description \

FROM Bugs sql_attr_timestamp = date_reported
sql_attr_str2ordinal = status sql_query_info = SELECT * FROM Bugs WHERE bug_id = $id

}

index bugs {

source

= bugsrc = /opt/local/var/db/sphinx/bugs

path }

Once we declare this configuration in sphinx.conf, we can create the index at the shell with the indexer command:

Enter to Rename, Shift+Enter to Preview

Creating an index using the indexer command

We can search the index using the search command:

Enter to Rename, Shift+Enter to Preview

Searching the index in Sphinx Search

Sphinx Search also has a daemon process and an API with which to invoke searches from popular scripting languages such as PHP, Perl, and Ruby. The major disadvantage of this software is that the index algorithm doesn’t support incremental updates efficiently. Using Sphinx Search over a data source that updates frequently requires some compromises. For example, we may need to split our searchable table into two tables: the first to store the majority of historical data that doesn’t change and the second to store a smaller set of current data that grows incrementally and must be reindexed. Then our application must search two Sphinx Search indexes.

Apache Lucene#

Lucene is a mature search engine for Java applications. Work-alike projects exist for other languages, including C++, C#, Perl, Python, Ruby, and PHP.

Lucene builds an index in its proprietary format for a collection of text documents. The Lucene index doesn’t stay in sync with the source data it indexes. If we insert, delete, or update rows in the database, we must apply matching changes to a Lucene index.

Using the Lucene search engine is a bit like using a car engine; we need quite a bit of supporting technology around it to make it useful. Lucene doesn’t read data collections from an SQL database directly. Instead, we have to write documents in the Lucene index. For example, we could run an SQL query and create one Lucene document for each row of the result and save it to the Lucene index. We can use Lucene through its Java API.

Fortunately, Apache also offers a complementary project called Solr. Solr is a server that provides a gateway to a Lucene index. We can add documents to Solr and submit search queries using a REST-like interface so we can use it from any programming language.

We can also direct Solr to connect to a database itself, run a query, and index the results using its DataImportHandler tool.

Roll your own#

Let’s suppose you don’t want to use proprietary search features, nor do you want to install an independent search engine product. It would be best if you had an efficient, database-independent solution to make text searchable. In this section, we design what’s called an inverted index.

An inverted index is a list of all words one might search for. In a many-to-many relationship, the index associates these words with the text entries that contain the respective word. For example, a word like “crash” can appear in many bugs, and each bug may match many other keywords. This section shows how to design an inverted index.

First, define a table Keywords to list the keywords for which users will search for, and define an intersection table BugsKeywords to establish a many-to-many relationship:

CREATE TABLE Keywords ( 
  keyword_id SERIAL PRIMARY KEY, 
  keyword VARCHAR(40) NOT NULL, 
  UNIQUE KEY (keyword) 
  );
CREATE TABLE BugsKeywords ( 
  keyword_id BIGINT UNSIGNED NOT NULL, 
  bug_id BIGINT UNSIGNED NOT NULL, 
  PRIMARY KEY (keyword_id, bug_id), 
  FOREIGN KEY (keyword_id) REFERENCES Keywords(keyword_id), 
  FOREIGN KEY (bug_id) REFERENCES Bugs(bug_id) 
  );

Enter to Rename, Shift+Enter to Preview

Creating Keywords and BugsKeywords tables

Next, let’s add a row to BugsKeywords for every keyword that matches the description text for a given bug. We can use a substring-match query to determine these matches using LIKE or regular expressions. This is more costly than the naive searching method described in the Antipattern lesson, but we gain efficiency because we only need to perform the search once. If we save the result in the intersection table, all subsequent searches for the same keyword will be much faster.

Next, we write a stored procedure to make it easier to search for a given keyword. If the word has already been searched, the query is faster because the rows in BugsKeywords are a list of all the documents that contain the keyword. If no one has searched for the given keyword before, it’s more difficult to collect text entries.

DELIMITER $$
CREATE PROCEDURE BugsSearch(keyword VARCHAR(40))
    BEGIN
        SET @keyword = keyword;
    PREPARE s1 FROM 'SELECT MAX(keyword_id) INTO @k FROM Keywords
            WHERE keyword = ?';
        EXECUTE s1 USING @keyword;
        DEALLOCATE PREPARE s1;
    IF (@k IS NULL) THEN
        PREPARE s2 FROM 'INSERT INTO Keywords (keyword) VALUES (?)';
        EXECUTE s2 USING @keyword;
        DEALLOCATE PREPARE s2;
        SELECT LAST_INSERT_ID() INTO @k;
        PREPARE s3 FROM 'INSERT INTO BugsKeywords (bug_id, keyword_id)
            SELECT bug_id, ? FROM Bugs
            WHERE summary REGEXP CONCAT(''[[:<:]]'', ?, ''[[:>:]]'')
                OR description REGEXP CONCAT(''[[:<:]]'', ?, ''[[:>]]'')';
        EXECUTE s3 USING @k, @keyword, @keyword;
        DEALLOCATE PREPARE s3;
    END IF;
    PREPARE s4 FROM 'SELECT b.* FROM Bugs b
        JOIN BugsKeywords k USING (bug_id)
        WHERE k.keyword_id = ?';
    EXECUTE s4 USING @k;
    DEALLOCATE PREPARE s4;
END$$

Enter to Rename, Shift+Enter to Preview

Creating procedure in MySQL

Let’s try the following steps to write the procedure in MySQL:

Search for the user-specified keyword. The code returns either the integer primary key from Keywords.keyword_id or null if the word has not been seen previously.
If the word was not found, insert it as a new word.
Query for the primary key value generated in Keywords.
Populate the intersection table by searching Bugs for rows containing the new keyword.
Finally, query the full rows from Bugs that match the keyword_id, whether the keyword was found or had to be inserted as a new entry.

Now we can call this stored procedure and pass the desired keyword. The procedure returns the set of matching bugs, whether it has to calculate the matching bugs and populate the intersection table for a new keyword or whether it simply benefits from the results of an earlier search.

Enter to Rename, Shift+Enter to Preview

Calling the procedure in MySQL

We can do more with this solution: we can define a trigger to populate the intersection table as each new bug is inserted. If we need to support edits to bug descriptions, we may also have to write a trigger to reanalyze text and add or delete rows in the BugsKeywords table.

Enter to Rename, Shift+Enter to Preview

Creating trigger in MySQL

The keyword list is populated naturally as users perform searches, so we don’t need to fill the keyword list with every word found in the knowledge base articles. On the other hand, if we can anticipate likely keywords, we can easily run a search for them, thus bearing the initial cost of searching for each keyword so that it doesn’t fall on our users.

Returning to the challenge described at the beginning of this chapter that I faced, I used an inverted index for my knowledge base application as the solution. I also enhanced the Keywords table with an additional column num_searches. I incremented this column each time a user searched for a given keyword so I could track which searches were most in demand.

Introduction

Logical Antipattern - Jaywalking

Logical Antipattern - Naive Trees

Logical Antipattern - ID Required

Logical Antipattern - Keyless Entry

Assessment: Logical Antipatterns - Part 1

Logical Antipattern - Entity-Attribute-Value

Logical Antipattern - Polymorphic Associations

Logical Antipattern - Multicolumn Attributes

Logical Antipattern - Metadata Tribbles

Assessment: Logical Antipatterns - Part 2

Physical Antipattern - Rounding Errors

Physical Antipattern - 31 Flavors

Physical Antipattern - Phantom Files

Physical Antipattern - Index Shotgun

Assessment: Physical Antipatterns

Query Antipattern - Fear of the Unknown

Query Antipattern - Ambiguous Groups

Query Antipattern - Random Selection

Query Antipattern - Poor Man’s Search Engine

Query Antipattern - Spaghetti Query

Query Antipattern - Implicit Columns

Assessment: Query Antipattern

Application Development Antipattern - Readable Passwords

Application Development Antipattern - SQL Injection

Application Development Antipattern - Pseudokey Neat-Freak

Application Development Antipattern - See No Evil

Application Development Antipattern - Diplomatic Immunity

Application Development Antipattern - Magic Beans

Assessment: Application Development Antipattern

Conclusion

Alternatives To Make Text Searchable

Third-party search engines#

Sphinx Search#

Apache Lucene#

Roll your own#